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Abstract 

Precision-recall (PR) curves and the areas 
under them are widely used to summarize 
machine learning results, especially for data 
sets exhibiting class skew. They are often 
used analogously to ROC curves and the area 
under ROC curves. It is known that PR 
curves vary as class skew changes. What was 
not recognized before this paper is that there 
is a region of PR space that is completely 
unachievable, and the size of this region de- 
pends only on the skew. This paper precisely 
characterizes the size of that region and dis- 
cusses its implications for empirical evalua- 
tion methodology in machine learning. 

1. Introduction 

Precision-recall (PR) curves are a common way to eval- 
uate the performance of a machine learning algorithm. 
PR curves illustrate the tradeoff between the propor- 
tion of positively labeled examples that are truly pos- 
itive (precision) as a function of the proportion of cor- 
rectly classified positives (recall). In particular, PR 
analysis is preferred to ROC analysis when there is a 
large skew in the class distribution. In this situation, 
even a relatively low false positive rate can produce a 
large number of false positives and hence a low preci- 
sion (Davis & Goadrich, 2006). Many applications are 

Appearing in Proceedings of the 29^^ International Confer- 
ence on Machine Learning, Edinburgh, Scotland, UK, 2012. 
Copyright 2012 by the author(s)/owner(s). 



characterized by a large skew in the class distribution. 
In information retrieval (IR), only a few documents 
are relevant to a given query. In medical diagnoses, 
only a small proportion of the population has a spe- 
cific disease at any given time. In relational learning, 
only a small fraction of the possible groundings of a 
relation are true in a database. 

The area under the precision-recall curve (AUCPR) of- 
ten serves as a summary statistic when comparing the 
performance of different algorithms. For example, IR 
systems are frequently judged by their mean average 
precision, or MAP (not to be confused with the same 
acronym for "maximum a posteriori"), which is an 
approximation of the mean AUCPR over the queries 
(Manning et al., 2008). Similarly, AUCPR often serves 
as an evaluation criteria for statistical relational learn- 
ing (SRL) (Kok & Domingos, 2010; Davis et al., 2005; 
Sutskever et al., 2010; Mihalkova & Mooney, 2007) and 
information extraction (IE) (Ling & Weld, 2010; Goad- 
rich et al., 2006). Additionally, some algorithms, such 
as SVM-MAP (Yuc et al., 2007) and SAYU (Davis 
et al., 2005), explicitly optimize the AUCPR of the 
learned model. 

There is a growing body of work that analyzes the 
properties of PR curves (Davis & Goadrich, 2006; 
Clcmengon & Vayatis, 2009). Still, PR curves and 
AUCPR are frequently treated as a simple substitute 
in skewed domains for ROC curves and area under the 
ROC curve (AUCROC), despite the known differences 
between PR and ROC curves. These differences in- 
clude that for a given ROC curve the corresponding 
PR curve varies with class skew (Davis & Goadrich, 
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2006). A related, but previously unrecognized, dis- 
tinction between the two types of curves is that, while 
any point in ROC space is achievable, not every point 
in PR space is achievable. That is, for a given data 
set it is possible to construct a confusion matrix that 
corresponds to any (false positive rate, true positive 
rate) pair, but it is not possible to do this for every 
(recall, precision) pair.^ 

We show that this distinction between ROC space and 
PR space has major implications for the use of PR 
curves and AUCPR in machine learning. The foremost 
is that the unachievable points define a minimum PR 
curve. The area under the minimum PR curve con- 
stitutes a portion of AUCPR that any algorithm, no 
matter how poor, is guaranteed to obtain "for free." 
Figure 1 illustrates the phenomenon. Interestingly, we 
prove that the size of the unachievable region is only a 
function of class skew and has a simple, closed form. 

The unachievable region can influence algorithm evalu- 
ation and even behavior in many ways. Even for eval- 
uations using Fl score, which only consider a single 
point in PR space, the unachievable region has subtle 
implications. When averaging AUCPR over multiple 
tasks (e.g., SRL target predicates or IR queries), the 
area under the minimum PR curve alone for a non- 
skewed task may outweigh the total area for all other 
tasks. A similar effect can occur when the folds used 
for cross-validation do not have the same skew. Down- 
sampling that changes the skew will also change the 
minimum PR curve. In algorithms that explicitly op- 
timize AUCPR or MAP during training, algorithm be- 
havior can change substantially with a change in skew. 
These undesirable effects of the unachievable region 
can be at least partially offset with straightforward 
modifications to AUCPR, which we describe. 

2. Achievable and Unachievable Points 
in PR Space 

We first precisely define the notion of an achievable 
point in PR space. Then we provide an intuitive exam- 
ple to illustrate the concept of an unachievable point. 
Finally, in Theorems 1 and 2 we present our central 
theoretical contributions that formalize the notion of 
the unachievable region in PR space. 

We assume familiarity with precision, recall, and con- 
fusion matrices (see Davis and Goadrich (2006) for an 
overview). We use p for precision, r for recall, and 
tp,fp,fn, tn for the number of true positives, false pos- 
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^To be strictly true in ROC space, fractional counts for 
tp, fp, fn, tn must be allowed. The fractional counts can 
be considered integer counts in an expanded data set. 



Figure 1. Minimum PR curve and random guessing curve 
at a skew of 1 positive for every 2 negative examples. 

itives, false negatives, and true negatives, respectively. 

Consider a data set D with n — pos + neg examples, 
where pos is the number of positive examples and neg 
is the number of negative examples. A valid confu- 
sion matrix for D is a tuple {tp, fp, fn,tn) such that 
tp,fp,fn, tn > 0, tp + fn = pos and fp + tn — neg. 
We use TT = the proportion of examples that are 
positive, to quantify the skew of D. Following conven- 
tion, highly skewed refers to tt near and non- or less 
skewed to tt near 0.5. 

Definition 1. For a data set D, an achievable point 
in PR space is a point {r,p) such that there exists a 
valid confusion matrix with recall r and precision p. 

2.1. Unachievable Points in PR Space 

One can easily show that, like in ROC space, each valid 
confusion matrix, where tp > 0, defines a single and 
unique point in PR space. In PR space, both recall 
and precision depend on the tp cell of the confusion 
matrix, in contrast to the true positive rate and false 
positive rate used in ROC space. This dependence, 
together with the fact that a specific data set contains 
a fixed number of negative and positive examples, im- 
poses limitations on what precisions are possible for a 
particular recall. 

To illustrate this effect, consider a data set with pos = 
100 and neg = 200. Table 1(a) shows a valid confusion 
matrix with r = 0.2 and p = 0.2. Consider holding 
precision constant while increasing recall. Obtaining 
r — 0.4 is possible with tp — 40 and fn — 60. Notice 
that keeping p = 0.2 requires increasing fp from 80 to 
160. With a fixed number of negative examples in the 
data set, increases in fp cannot continue indefinitely. 
For this data set, r — 0.5 with p = 0.2 is possible by 
using all negatives as false positives (so tn = 0). How- 
ever, maintaining p = 0.2 for any r > 0.5 is impossible. 
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Table 1. (a) Valid confusion matrix with r — 0.2 and p = 
0.2 and (b) invalid confusion matrix attempting to obtain 
r = 0.6 and p = 0.2. 
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Table 1(b) illustrates an attempted confusion matrix 
with r = 0.6 and p = 0.2. Achieving p = 0.2 at this 
recall requires fp > neg. This forces tn < and makes 
the confusion matrix invalid. 

The following theorem formalizes this restriction on 
achievable points in PR space. 

Theorem 1. Precision (p) and recall (r) must satisfy, 
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where tt is the skew. 



Proof. Starting from the definition of precision, 
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since the false positives cannot be greater than the 
number of negatives, tp = rirn from the definition of 
recall, and we can reasonably assume the data set is 
non-empty [n > 0) so the ns cancel out. Thus 
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If a point in PR space satisfies Eq. (1), we say it is 
achievable. Note that a point's achievability depends 
solely on the skew and not on a data set's size. Thus 
we often refer to achievability in terms of the skew and 
not in reference to any particular data set. 

2.2. Unachievable Region in PR Space 

Theorem 1 gives a constraint that each achievable 
point in PR space must satisfy. For a given skew, there 
are many points that are unachievable, and we refer to 
this collection of points as the unachievable region of 
PR space. This subsection studies the properties of 
the unachievable region. 

Eq. (1) makes no assumptions about a model's per- 
formance. Consider a model that gives the worst pos- 
sible ranking where every negative example is ranked 
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Figure 2. Minimum PR curves for several values of n. 



ahead of every positive example. Building a PR curve 
based on this ranking means placing one PR point at 
(0,0) and a second PR point at (1, ^). Davis and 
Goadrich (2006) provide the correct method for inter- 
polating between points in PR space; interpolation is 
non-linear in PR space but is linear between the corre- 
sponding points in ROC space. Interpolating between 
the two known points gives intermediate points with 
recall of = ^ and precision of pi — (fiz^y^i^r^: for 
< i < pos. This is the equality case from Theorem 1, 
so Eq. (1) is a tight lower bound on precision. We 
call the curve produced by this ranking the minimum 
PR curve because it lies on the boundary between the 
achievable and unachievable regions of PR space. For 
a given skew, all achievable points are on or above the 
minimum PR curve. 

The minimum PR curve has an interesting implication 
for AUCPR and average precision. Any model must 
produce a PR curve that lies above the minimum PR 
curve. Thus, the AUCPR score includes the size of the 
unachievable region "for free." In the following theo- 
rem, we provide a closed form solution for calculating 
the area of the unachievable region. 

Theorem 2. The area of the unachievable region in 
PR space and the minimum AUCPR, for skew tt, is 



AUCPR 
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(l-7r)ln(l-7r) 
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Proof. Since Eq. (1) gives a lower bound for the preci- 
sion at a particular recall, the unachievable area is the 
area below the curve /(r) 
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AUCPRmin = 
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See Figure 3 for AUCPRmin at different skews. 

Similar to AUCPR, Eq. (1) also defines a minimum 
for average precision (AP). Average precision is the 
mean precision after correctly labeling each positive 
example in the ranking, so the minimum takes the form 
of a discrete summation. Unlike AUCPR, which is 
calculated from interpolated curves, the minimum AP 
depends on the number of positive examples because 
that controls the number of terms in the summation. 

Theorem 3. The minimum AP, for pos and neg pos- 
itive and negative examples, respectively, is 
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This precisely captures the natural intuition that the 
worst AP involves labeling all negatives examples as 
positive before starting to label the positives. 

The existence of the minimum AUCPR and mini- 
mum AP can affect the qualitative interpretation of a 
model's performance. For example, changing the skew 
of a data set from 0.01 to 0.5 (e.g., by subsampling the 
negative examples (Natarajan et al., 2011; Sutskever 
et al., 2010)) increases the minimum AUCPR by ap- 
proximately 0.3. This leads to an automatic jump of 
0.3 in AUCPR simply by changing the data set and 
with absolutely no change to the learning algorithm. 

Since the majority of the unachievable region is at 
higher recalls, the effect of AUCPRmin becomes more 
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Figure 3. Minimum AUCPR versus vr for area calculated 
over recall in [0,1] (entire PR curve), [0.5,1], and [0.8,1]. 

pronounced when restricting the area calculation to 
high levels of recall. Calculating AUCPR for recalls 
above a threshold is frequently done due to the high 
variance of precision at low recall or because the learn- 
ing problem requires high recall solutions (e.g., medical 
domains such as breast cancer risk prediction) . Corol- 
lary 4 gives the formula for computing AUCPRmin 
when the area is calculated over a restricted range of 
recalls. See Figure 3 for minimum AUCPR when cal- 
culating area over restricted recall. 

Corollary 4. For calculation of AUCPR over recalls 
in [a, b] where < a < b < 1, 



AUCPRmin = b-a 



1 - TT 



In 



7r(a- 1) + 1 
7r(5- 1) + 1 



3. PR Space Metrics that Account for 
Unachievable Region 

The unachievable region represents a lower bound on 
AUCPR and it is important to develop evaluation met- 
rics that account for this. We believe that any metric 
A' that replaces AUCPR should satisfy at least the 
following two properties. First, A' should relate to 
AUCPR. Assume AUCPR was used to estimate the 
performance of classifiers Ci , . . . , C„ on a single test 
set. If AUCPR(C,,testD) > AUCPR(Cj, testn), then 
A'(Ci,testD) > A'(Cj, testn), as test set testn's skew 
affects each model equally. Note that this property 
may not be appropriate or desirable when aggregat- 
ing scores across multiple test sets, as done in cross 
validation, because each test set may have a differ- 
ent skew. Second, A' should have the same range for 
every data set, regardless of skew. This is necessary, 
though not sufficient, to achieve meaningful compar- 
isons across data sets. AUCPR does not satisfy the 
second requirement because, as shown in Theorem 2, 
its range depends on the data set's skew. 
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We propose the normahzed area under the PR curve 
(AUCNPR). From AUCPR, we subtract the minimum 
AUCPR, so the worst ranking has a score of 0. We 
then normahze so the best ranking has a score of 1. 



AUCNPR = 



AUCPR - AUCPR 



MIN 



AUCPR 



MAX 



- AUCPR 



MIN 



where AUCPRmax = 1 when calculating area under 
the entire PR curve and AUCPRmax — b — a when 
restricting recall to a < r < b. 

Regardless of skew, the best possible classifier will 
have an AUCNPR of 1 and the worst possible clas- 
sifier will have an AUCNPR of 0. AUCNPR also pre- 
serves the ordering of algorithms on the same test set 
since AUCPRmax and AUCPRmin are constant for 
the same data set. Thus, AUCNPR satisfies our pro- 
posed requirements for a replacement of AUCPR. Fur- 
thermore, by accounting for the unachievable region, 
it makes comparisons between data sets with different 
skews more meaningful than AUCPR. 

An alternative to AUCNPR would be to normalize 
based on the AUCPR for random guessing, which is 
simply TT. This has two drawbacks. First, the range of 
scores depends on the skew, and therefore is not con- 
sistent across different data sets. Second, it can result 
in a negative score if an algorithm performs worse than 
random guessing, which seems counter-intuitive for an 
area under a curve. 

A discussion of degenerate data sets with tt = or tt = 
1, where AUCPRmin and AUCNPR are undefined, is 
in our technical report (Boyd et al., 2012). 

4. Discussion and Recommendations 

We believe all practitioners using evaluation scores 
based on PR space (e.g., PR curves, AUCPR, AP, Fl) 
should be cognizant of the unachievable region and 
how it may affect their analysis. 

Visually inspecting the PR curve or looking at an 
AUCPR score often gives an intuitive sense for the 
quality of an algorithm or difficulty of a task or data 
set. If the skew is extremely large, the effect of the very 
small unachievable region is negligible on PR analysis. 
However, there are many instances where the skew 
is closer to 0.5 and the unachievable area is not in- 
significant. With TT = 0.1, AUCPRmin « 0.05, and 
it increases as tt approaches 0.5. AUCPR is used in 
many applications where tt > 0.1 (Hu et al., 2009; 
Sonnenburg et al., 2006; Liu & Shriberg, 2007). Thus 
a general awareness of the unachievable region and its 
relationship to skew is important when casually com- 
paring or inspecting PR curves and AUCPR scores. A 



simple recommendation that will make the unachiev- 
able region's impact on results clear is to always show 
the minimum PR curve on PR curve plots. 

Next, we discuss several specific situations where the 
unachievable region is highly relevant. 

4.1. Aggregation for Cross- Validation 

In cross validation, stratification typically allows dif- 
ferent folds to have similar skews. However, partic- 
ularly in relational domains, this is not always the 
case. In relational domains, stratification must con- 
sider fold membership constraints imposed by links be- 
tween objects that, if violated, would bias the results 
of cross validation. For example, consider the bioinfor- 
matics task of protein secondary structure prediction. 
Putting amino acids from the same protein in different 
folds has two drawbacks. First, it could bias the re- 
sults as information about the same protein is in both 
the train and test set. Second, it does not properly 
simulate the ultimate goal of predicting the structure 
of entirely novel proteins. Links between examples oc- 
cur in most relational domains, and placing all linked 
items in the same fold can lead to substantial variation 
in the skew of the folds. Since the different skews yield 
different AUCPRmin, care must be taken when aggre- 
gating results to create a single summary statistic of 
an algorithm's performance. 

Cross validation assumes that each fold is sampled 
from the same underlying distribution. Even if the 
skew varies across folds, the merged data set is the 
best estimate of the underlying distribution and thus 
the overall skew. Ideally, aggregate descriptions, like a 
PR curve or AUCPR, should be calculated on a single, 
merged data set. Merging directly compares probabil- 
ity estimates for examples in different folds and as- 
sumes that the models are calibrated. Unfortunately, 
this is rarely a primary goal of machine learning and 
learned models tend to be poorly calibrated (Forman 
& Scholz, 2010). 

With uncalibratcd models, the most common practice 
is to average the results from each fold. For AUCPR, 
the summary score is the mean of the AUCPR from 
each fold. For a PR curve, vertical averaging of the in- 
dividual PR curves from each fold provides a summary 
curve. In both cases, averaging fails to account for any 
difference in the unachievable regions that arise due 
to variations in class skew. As shown in Theorem 2, 
the range of possible AUCPR values varies according 
to a fold's skew. Similarly, when vertically averaging 
PR curves, a particular recall level will have varying 
ranges of potential precision values for each fold if the 
folds have different skews. Even a single fold, which 
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has much higher precision values due to a substantially 
lower skew, can cause a higher vertically averaged PR 
curve because of its larger unachievable region. Failing 
to account for fold-by-fold variation in skew can lead 
to overly optimistic assessments when using straight- 
forward averaging. 

We recommend averaging AUCNPR instead of 
AUCPR when evaluating area under the curve. Aver- 
aging AUCNPR, which has the same range regardless 
of skew, helps reduce (but not eliminate) skew's effect 
compared to averaging AUCPR. A similar normaliza- 
tion approach for summarizing the PR curve leads to a 
non- linear transformation of PR space that can change 
the area under the curves in unexpected ways. An 
effective method for generating a summary PR curve 
that preserves measures of area in a satisfying way and 
accounts for the unachievable region would be useful 
and is a promising area of future research. 

4.2. Aggregation among Different Tasks 

Machine learning algorithms are commonly evaluated 
on several different tasks. This setting differs from 
cross-validation because each task is not assumed to 
have the same underlying distribution. While the 
tasks may be unrelated (Tang ct al., 2009), often they 
come from the same domain. For example, the tasks 
could be the truth values of different predicates in a 
relational domain (Kok & Domingos, 2010; Mihalkova 
& Mooney, 2007) or different queries in an IR setting 
(Manning et al., 2008). Often, researchers report a 
single, aggregate score by averaging the results across 
the different tasks. However, the tasks can potentially 
have very different skews, and hence different mini- 
mum AUCPR. Therefore, averaging AUCNPR scores, 
which (somewhat) control for skew, is preferred to av- 
eraging AUCPR. 

In SRL, researchers frequently evaluate algorithms by 
reporting the average AUCPR over a variety of tasks 
in a single data set (Mihalkova & Mooney, 2007; Kok 
& Domingos, 2010). As a case study, consider the com- 
monly used IMDB data set. Here, the task is to predict 
the probability that each possible grounding of each 
predicate is true. Across all predicates in IMDB the 
skew of true groundings is relatively low (tt = 0.06), 
but there is significant variation in the skew of indi- 
vidual predicates. For example, the gender predicate 
has a skew close to tt = 0.5, whereas a predicate such 
as genre has a skew closer to tt = 0.05. While present- 
ing the mean AUCPR across all predicates is a good 
first approach, it leads to averaging values that do not 
all have the same range. For example, the gender 
predicate's range is [0.31, 1.0] while the genre pred- 



Table 2. Average AUCPR and AUCNPR scores for each 
predicate in the IMDB set. Results are for the LSM algo- 
rithm from Kok and Domingos (2010). The range of scores 
shows the difficulty and skews of the prediction tasks vary 
greatly. By accounting for the (potentially large) unachiev- 
able regions, AUCNPR yields a more conservative overall 
estimate of performance. 



Predicate 


AUCPR 


AUCNPR 


actor 


1.000 


1.000 


director 


1.000 


1.000 


gender 


0.509 


0.325 


genre 


0.624 


0.611 


movie 


0.267 


0.141 


workedUnder 


1.000 


1.000 


MEAN 


0.733 


0.680 



icate's range is [0.02,1.0]. Thus, an AUCPR of 0.4 
means very different things on these two predicates. 
For the gender predicate, this score is worse than ran- 
dom guessing, while for the genre predicate this is a 
reasonably high score. In a sense, all AUCPR scores of 
0.4 are not created equal, but averaging the AUCPR 
treats them as equals. 

Table 2 shows AUCPR and AUCNPR for each predi- 
cate on a Markov logic network model learned by the 
LSM algorithm (Kok & Domingos, 2010). Notice the 
wide range of scores and that AUCNPR gives a more 
conservative overall estimate. AUCNPR is still sensi- 
tive to skew, so an AUCNPR of 0.4 in the aforemen- 
tioned predicates still does not imply completely com- 
parable performances, but it is closer than AUCPR. 

4.3. Downsampling 

Downsampling is common when learning on highly 
skewed tasks. Often the downsampling alters the skew 
on the train set (e.g., subsampling the negatives to fa- 
cilitate learning, or using data from case-control stud- 
ies) such that it does not reflect the true skew. PR 
analysis is frequently used on the downsampled data 
sets (Sonnenburg et al., 2006; Natarajan et al., 2011; 
Sutskever ct al., 2010). The sensitivity of AUCPR and 
related scores makes it important to recognize, and if 
possible quantify, the effect of downsampling on eval- 
uation metrics. 

The varying size of the unachievable region provides 
an explanation and quantification of some of the de- 
pendence of PR curves and AUCPR on skew. Thus, 
AUCNPR, which adjusts for the unachievable region, 
should be more stable than AUCPR to changes in 
skew. To explore this, we used SAYU (Davis ct al., 
2005) to learn a model for the advisedBy task in the 
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Table 3. AUCPR and AUCNPR scores for SAYU on UW- 
CSE advisedBy task for different train set skews. The 
downsampled columns report scores on a test set with the 
same downsampled skew as the train set. The original skew 
columns report scores on the original test set with a ratio 
of 1 positive to 24 negatives (n = 0.04). 





Downsampled 


Original Skew 


Ratio 


AUCPR 


AUCNPR 


AUCPR 


AUCNPR 


1:1 


0.851 


0.785 


0.330 


0.316 


1:2 


0.740 


0.680 


0.329 


0.315 


1:3 


0.678 


0.627 


0.343 


0.329 


1:4 


0.701 


0.665 


0.314 


0.299 


1:5 


0.599 


0.560 


0.334 


0.320 


1:10 


0.383 


0.352 


0.258 


0.242 


1:24 


0.363 


0.349 


0.363 


0.349 



UW-CSE domain for several downsampled train sets. 
Table 3 shows the AUCPR and AUCNPR scores on a 
test set downsampled to the same skew as the train set 
and on the original (i.e., non-downsampled) test set. 
AUCNPR has less variance than AUCPR. However, 
there is still a sizable difference between the scores 
on the downsampled test set and the original test set. 
As expected, the difference increases as the ratio ap- 
proaches 1 positive to 1 negative. At this ratio, even 
the AUCNPR score on the downsampled data is more 
than twice the score on the original skew. This is a 
massive difference and it is disconcerting that it occurs 
simply by changing the data set skew. An intriguing 
area for future research is to investigate scoring metrics 
that either are less sensitive to skew or permit simple 
and accurate transformations that facilitate compar- 
isons between different skews. 

4.4. Fl Score 

A commonly used evaluation metric for a single point 
in PR space is the family, 

^ il + f3^)pr 

where /3 > is a parameter to control the relative im- 
portance of recall and precision (Manning et al., 2008). 
Most frequently, the Fl score (/3 = 1), which is the har- 
monic mean of precision and recall, is used. We focus 
our discussion on the Fl score, but similar analysis 
applies to F/s. Figure 4 shows contours of the Fl score 
over PR space. 

While the unachievable region of PR space does not 
put any bounds on Fl score based on skew, there is 
still a subtle interaction between skew and Fl. Since 
Fl combines precision and recall into a single score, 
it necessarily loses information. One aspect of this 



information loss is that PR points with the same Fl 
score can have vastly different relationships with the 
unachievable region. Consider points A, B, and C in 
Figure 4. All three have an Fl score of 0.45, but each 
has a very different interpretation if obtained from a 
data set with tt = 0.33. Point A is unachievable and no 
valid confusion matrix for it exists. Point B is achiev- 
able, but is very near the minimum PR curve and is 
only marginally better than random guessing. Point 
C has reasonable performance representing good pre- 
cision at modest recall. 

While losing information is inevitable with a summary 
like Fl, the different interpretations arise partly be- 
cause Fl treats recall and precision interchangeably. 
Furthermore, this is not unique to /3 — 1. While 
changes the relative importance, the assumption re- 
mains that precision and recall, appropriately scaled 
by /3, are equivalent for assessing performance. Our 
results on the unachievable region show this is prob- 
lematic as recall and precision have fundamentally dif- 
ferent properties. Every recall has a minimum preci- 
sion, while there is a maximum recall for low precision, 
and no constraints for most levels of precision. 

While a modified Fl score that is sensitive to the un- 
achievable region would be useful, initial work suggests 
an ideal solution may not exist. Consider three simple 
requirements for a modified Fl score, /': 



f'{ri,p) < f'{r2,p) iff n < r2 (4) 
/' {r, Pi) < f (r, P2 ) iff Pi < P2 (5) 

Eq. (3) ensures /' = if the PR point is on the mini- 
mum PR curve and Eqs. (4) and (5) capture the expec- 
tation that an increase in precision or recall while the 
other is constant should always increase /'. However, 
these three properties are impossible to satisfy because 
they require = /'(0,0) < /' (0,7r) < /'(l,7r) = 0. Re- 
laxing Eqs. (4) and (5) to < makes it possible to con- 
struct an /' that satisfies the requirements but implies 
f'{r,p) = if p < TT. This seems unsatisfactory be- 
cause it ignores all distinctions once the performance 
is worse than random guessing. One modified Fl score 
that satisfies the relaxed requirements would assign 
to any PR point worse than random guessing and use 
the harmonic mean of recall and (precision nor- 
malized to random guessing) otherwise. 

Ultimately, while Fl score or a modified Fl score 
can be extremely useful, nuanced analyses must never 
overlook that it is a summary metric, and vital infor- 
mation for interpreting a model's performance may be 
lost in the summarizing. 
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Figure 4. Contours of Fl score in PR space with the mini- 
mum PR curve and unachievable region for n = 0.33. The 
points A, B, and C all have Fl = 0.45, but lead to sub- 
stantially different practical interpretations. 

5. Conclusion 

Wc demonstrate that a region of precision-recall space 
is unachievable for any particular ratio of positive to 
negative examples. With the precise characterization 
of this unachievable region given in Theorems 1 and 2, 
we further the understanding of the effects of down- 
sampling and the impact of the minimum PR curve 
on F measure and score aggregation. 
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